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Abstract 

We  compare  and  contrast  two  different  models  for 
detecting  sentence-like  units  in  continuous  speech. 
The  first  approach  uses  hidden  Markov  sequence 
models  based  on  N-grams  and  maximum  likeli¬ 
hood  estimation,  and  employs  model  interpolation 
to  combine  different  representations  of  the  data. 
The  second  approach  models  the  posterior  proba¬ 
bilities  of  the  target  classes;  it  is  discriminative  and 
integrates  multiple  knowledge  sources  in  the  max¬ 
imum  entropy  (maxent)  framework.  Both  models 
combine  lexical,  syntactic,  and  prosodic  informa¬ 
tion.  We  develop  a  technique  for  integrating  pre¬ 
trained  probability  models  into  the  maxent  frame¬ 
work,  and  show  that  this  approach  can  improve 
on  an  HMM-based  state-of-the-art  system  for  the 
sentence-boundary  detection  task.  An  even  more 
substantial  improvement  is  obtained  by  combining 
the  posterior  probabilities  of  the  two  systems. 

1  Introduction 

Sentence  boundary  detection  is  a  problem  that  has 
received  limited  attention  in  the  text-based  com¬ 
putational  linguistics  community  (Schmid,  2000; 
Palmer  and  Hearst,  1994;  Reynar  and  Ratnaparkhi, 
1997),  but  which  has  recently  acquired  renewed  im¬ 
portance  through  an  effort  by  the  DARPA  EARS 
program  (DARPA  Information  Processing  Technol¬ 
ogy  Office,  2003)  to  improve  automatic  speech  tran¬ 
scription  technology.  Since  standard  speech  recog¬ 
nizers  output  an  unstructured  stream  of  words,  im¬ 
proving  transcription  means  not  only  that  word  ac¬ 
curacy  must  be  improved,  but  also  that  commonly 
used  structural  features  such  as  sentence  boundaries 
need  to  be  recognized.  The  task  is  thus  fundamen¬ 
tally  based  on  both  acoustic  and  textual  (via  auto¬ 
matic  word  recognition)  information.  From  a  com¬ 
putational  linguistics  point  of  view,  sentence  units 
arc  crucial  and  assumed  in  most  of  the  further  pro¬ 
cessing  steps  that  one  would  want  to  apply  to  such 
output:  tagging  and  parsing,  information  extraction, 
and  summarization,  among  others. 


Sentence  segmentation  from  speech  is  a  difficult 
problem.  The  best  systems  benchmarked  in  a  re¬ 
cent  government-administered  evaluation  yield  er¬ 
ror  rates  between  30%  and  50%,  depending  on  the 
genre  of  speech  processed  (measured  as  the  num¬ 
ber  of  missed  and  inserted  sentence  boundaries  as 
a  percentage  of  true  sentence  boundaries).  Because 
of  the  difficulty  of  the  task,  which  leaves  plenty  of 
room  for  improvement,  its  relevance  to  real-world 
applications,  and  the  range  of  potential  knowledge 
sources  to  be  modeled  (acoustics  and  text-based, 
lower-  and  higher-level),  this  is  an  interesting  chal¬ 
lenge  problem  for  statistical  and  computational  ap¬ 
proaches. 

All  of  the  systems  participating  in  the  recent 
DARPA  RT-03F  Metadata  Extraction  evaluation 
(National  Institute  of  Standards  and  Technology, 
2003)  were  based  on  a  hidden  Markov  model  frame¬ 
work,  in  which  word/tag  sequences  are  modeled  by 
N-gram  language  models  (LMs).  Additional  fea¬ 
tures  (mostly  reflecting  speech  prosody)  arc  mod¬ 
eled  as  observation  likelihoods  attached  to  the  N- 
gram  states  of  the  HMM  (Shriberg  et  al.,  2000).  The 
HMM  is  a  generative  modeling  approach,  since  it 
describes  a  stochastic  process  with  hidden  variables 
(the  locations  of  sentence  boundaries)  that  produces 
the  observable  data.  The  segmentation  is  inferred 
by  comparing  the  likelihoods  of  different  boundary 
hypotheses. 

While  the  HMM  approach  is  computationally  ef¬ 
ficient  and  (as  described  later)  provides  a  convenient 
way  for  modularizing  the  knowledge  sources,  it  has 
two  main  drawbacks:  First,  the  standard  training 
methods  for  HMMs  maximize  the  joint  probability 
of  observed  and  hidden  events,  as  opposed  to  the 
posterior  probability  of  the  correct  hidden  variable 
assignment  given  the  observations.  The  latter  is  a 
criterion  more  closely  related  to  classification  error. 
Second,  the  N-gram  LM  underlying  the  HMM  tran¬ 
sition  model  makes  it  difficult  to  use  features  that 
arc  highly  correlated  (such  as  word  and  POS  labels) 
without  greatly  increasing  the  number  of  model  pa¬ 
rameters;  this  in  turn  would  make  robust  estimation 


syntax,  semantics, 
word  selection. 


Figure  1 :  Diagram  of  the  sentence  segmentation  task, 
difficult. 

In  this  paper,  we  describe  our  effort  to  overcome 
these  shortcomings  by  1)  replacing  the  generative 
model  with  one  that  estimates  the  posterior  proba¬ 
bilities  directly,  and  2)  using  the  maximum  entropy 
(maxent)  framework  to  estimate  conditional  distri¬ 
butions,  giving  us  a  more  principled  way  to  com¬ 
bine  a  large  number  of  overlapping  features.  Both 
techniques  have  been  used  previously  for  traditional 
NLP  tasks,  but  they  are  not  straightforward  to  ap¬ 
ply  in  our  case  because  of  the  diverse  nature  of  the 
knowledge  sources  used  in  sentence  segmentation. 
We  describe  the  techniques  we  developed  to  work 
around  these  difficulties,  and  compare  classification 
accuracy  of  the  old  and  new  approach  on  different 
genres  of  speech.  We  also  investigate  how  word 
recognition  error  affects  that  comparison.  Finally, 
we  show  that  a  simple  combination  of  the  two  ap¬ 
proaches  turns  out  to  be  highly  effective  in  improv¬ 
ing  the  best  previous  results  obtained  on  a  bench¬ 
mark  task. 

2  The  Sentence  Segmentation  Task 

The  sentence  boundary  detection  problem  is  de¬ 
picted  in  Figure  1  in  the  source-channel  framework. 
The  speaker  intends  to  say  something,  chooses  the 
word  string,  and  imposes  prosodic  cues  (duration, 
emphasis,  intonation,  etc).  This  signal  goes  through 
the  speech  production  channel  to  generate  an  acous¬ 
tic  signal.  A  speech  recognizer  determines  the  most 
likely  word  string  given  this  signal.  To  detect  pos¬ 
sible  sentence  boundaries  in  the  recognized  word 
string,  prosodic  features  are  extracted  from  the  sig¬ 
nal,  and  combined  with  textual  cues  obtained  from 
the  word  string.  At  issue  in  this  paper  is  the  final 
box  in  the  diagram:  how  to  model  and  combine  the 
available  knowledge  sources  to  find  the  most  accu¬ 
rate  hypotheses. 

Note  that  this  problem  differs  from  the  sen¬ 
tence  boundary  detection  problem  for  written  text  in 
the  natural  language  processing  literature  (Schmid, 
2000;  Palmer  and  Flearst,  1994;  Reynar  and  Rat- 


naparkhi,  1997).  Flere  we  are  dealing  with  spo¬ 
ken  language,  therefore  there  is  no  punctuation  in¬ 
formation,  the  words  are  not  capitalized,  and  the 
transcripts  from  the  recognition  output  are  errorful. 
This  lack  of  textual  cues  is  partly  compensated  by 
prosodic  information  (timing,  pitch,  and  energy  pat¬ 
terns)  conveyed  by  speech.  Also  note  that  in  spon¬ 
taneous  conversational  speech  “sentence”  is  not  al¬ 
ways  a  straightforward  notion.  For  our  purposes  we 
use  the  definition  of  a  “sentence-like  unit”,  or  SU, 
as  defined  by  the  LDC  for  labeling  and  evaluation 
putposes  (Strassel,  2003). 

The  training  data  has  SU  boundaries  marked  by 
annotators,  based  on  both  the  recorded  speech  and 
its  transcription.  In  testing,  a  system  has  to  recover 
both  the  words  and  the  locations  of  sentence  bound¬ 
aries,  denoted  by  (W,  E)  =  wieiw^  . . .  w je*  ...wn 
where  W  represents  the  strings  of  word  tokens  and 
E  the  inter-word  boundary  events  (sentence  bound¬ 
ary  or  no  boundary). 

The  system  output  is  scored  by  first  finding  a  min¬ 
imum  edit  distance  alignment  between  the  hypothe¬ 
sized  word  string  and  the  reference,  and  then  com¬ 
paring  the  aligned  event  labels.  The  SU  error  rate  is 
defined  as  the  total  number  of  deleted  or  inserted  SU 
boundary  events,  divided  by  the  number  of  true  SU 
boundaries.1  For  diagnostic  purposes  a  secondary 
evaluation  condition  allows  use  of  the  correct  word 
transcripts.  This  condition  allows  us  to  study  the 
segmentation  task  without  the  confounding  effect  of 
speech  recognition  errors,  using  perfect  lexical  in¬ 
formation. 

3  Features  and  Knowledge  Sources 

Words  and  sentence  boundaries  are  mutually  con¬ 
strained  via  syntactic  structure.  Therefore,  the  word 
identities  themselves  (from  automatic  recognition 
or  human  transcripts)  constitute  a  primary  knowl¬ 
edge  source  for  the  sentence  segmentation  task.  We 
also  make  use  of  various  automatic  taggers  that  map 
the  word  sequence  to  other  representations.  The 
TnT  tagger  (Brants,  2000)  is  used  to  obtain  part-of- 
speech  (POS)  tags.  A  TBL  chunker  trained  on  Wall 
Street  Journal  corpus  (Ngai  and  Florian,  2001)  maps 
each  word  to  an  associated  chunk  tag,  encoding 
chunk  type  and  relative  word  position  (beginning  of 
an  NP,  inside  a  VP,  etc.).  The  tagged  versions  of 
the  word  stream  are  provided  to  allow  generaliza¬ 
tions  based  on  syntactic  structure  and  to  smooth  out 
possibly  undertrained  word-based  probability  esti- 

'This  is  the  same  as  simple  per-event  classification  accu¬ 
racy,  except  that  the  denominator  counts  only  the  “marked” 
events,  thereby  yielding  error  rates  that  are  much  higher  than 
if  one  uses  all  potential  boundary  locations. 


mates.  For  the  same  reasons  we  also  generate  word 
class  labels  that  are  automatically  induced  from  bi¬ 
gram  word  distributions  (Brown  et  al.,  1992). 

To  model  the  prosodic  structure  of  sentence 
boundaries,  we  extract  several  hundred  features 
around  each  word  boundary.  These  are  based  on  the 
acoustic  alignments  produced  by  a  speech  recog¬ 
nizer  (or  forced  alignments  of  the  true  words  when 
given).  The  features  capture  duration,  pitch,  and 
energy  patterns  associated  with  the  word  bound¬ 
aries.  Informative  features  include  the  pause  du¬ 
ration  at  the  boundary,  the  difference  in  pitch  be¬ 
fore  and  after  the  boundary,  and  so  on.  A  cru¬ 
cial  aspect  of  many  of  these  features  is  that  they 
are  highly  correlated  (e.g.,  by  being  derived  from 
the  same  raw  measurements  via  different  normaliza¬ 
tions),  real-valued  (not  discrete),  and  possibly  unde¬ 
fined  (e.g.,  unvoiced  speech  regions  have  no  pitch). 
These  properties  make  prosodic  features  difficult  to 
model  directly  in  either  of  the  approaches  we  arc  ex¬ 
amining  in  the  paper.  Hence,  we  have  resorted  to  a 
modular  approach:  the  information  from  prosodic 
features  is  modeled  separately  by  a  decision  tree 
classifier  that  outputs  posterior  probability  estimates 
P(ei\fi)'  where  e.L  is  the  boundary  event  after  Wi, 
and  fi  is  the  prosodic  feature  vector  associated  with 
the  word  boundary.  Conveniently,  this  approach 
also  permits  us  to  include  some  non-prosodic  fea¬ 
tures  that  are  highly  relevant  for  the  task,  but  not 
otherwise  represented,  such  as  whether  a  speaker 
(turn)  change  occurred  at  the  location  in  question.2 

A  practical  issue  that  greatly  influences  model  de¬ 
sign  is  that  not  all  information  sources  are  avail¬ 
able  uniformly  for  all  training  data.  For  example, 
prosodic  modeling  assumes  acoustic  data;  whereas, 
word-based  models  can  be  trained  on  text-only  data, 
which  is  usually  available  in  much  larger  quantities. 
This  poses  a  problem  for  approaches  that  model  all 
relevant  information  jointly  and  is  another  strong 
motivation  for  modular  approaches. 

4  The  Models 

4.1  Hidden  Markov  Model  for  Segmentation 

Our  baseline  model,  and  the  one  that  forms  the  ba¬ 
sis  of  much  of  the  prior  work  on  acoustic  sentence 
segmentation  (Shriberg  et  al.,  2000;  Gotoh  and  Re- 
nals,  2000;  Christensen,  2001;  Kim  and  Woodland, 
2001),  is  a  hidden  Markov  model.  The  states  of 
the  model  correspond  to  words  W{  and  following 

“Here  we  are  glossing  over  some  details  on  prosodic  mod¬ 
eling  that  are  orthogonal  to  the  discussion  in  this  paper.  For 
example,  instead  of  simple  decision  trees  we  actually  use  en¬ 
semble  bagging  to  reduce  the  variance  of  the  classifier  (Liu  et 
31,2004)."" 


Figure  2:  The  graphical  model  for  the  SU  detection 
problem.  Only  one  word+event  is  depicted  in  each  state, 
but  in  a  model  based  on  N-grams  the  previous  N  —  1 
tokens  would  condition  the  transition  to  the  next  state. 

event  labels  e.;.  The  observations  associated  with 
the  states  are  the  words,  as  well  as  other  (mainly 
prosodic)  features  /,;.  Figure  2  shows  a  graphi¬ 
cal  model  representation  of  the  variables  involved. 
Note  that  the  words  appear  in  both  the  states  and  the 
observations,  such  that  the  word  stream  constrains 
the  possible  hidden  states  to  matching  words;  the 
ambiguity  in  the  task  stems  entirely  from  the  choice 
of  events. 

4.1.1  Classification 

Standard  algorithms  are  available  to  extract  the  most 
probable  state  (and  thus  event)  sequence  given  a  set 
of  observations.  The  error  metric  is  based  on  clas¬ 
sification  of  individual  word  boundaries.  Therefore, 
rather  than  finding  the  highest  probability  sequence 
of  events,  we  identify  the  events  with  highest  poste¬ 
rior  individually  at  each  boundary  i: 

h  =  arg  max  P(ei\W,  F)  (1) 

where  W  and  F  are  the  words  and  features  for 
the  entire  test  sequence,  respectively.  The  individ¬ 
ual  event  posteriors  are  obtained  by  applying  the 
forward-backward  algorithm  for  HMMs  (Rabiner 
and  Juang,  1986). 

4.1.2  Model  Estimation 

Training  of  the  HMM  is  supervised  since  event- 
labeled  data  is  available.  There  are  two  sets  of  pa¬ 
rameters  to  estimate.  The  state  transition  proba¬ 
bilities  are  estimated  using  a  hidden  event  N-gram 
LM  (Stolcke  and  Shriberg,  1996).  The  LM  is 
obtained  with  standard  N-gram  estimation  meth¬ 
ods  from  data  that  contains  the  word+event  tags  in 
sequence:  uq ,  e\ ,  uq , . . .  en  i,  wn.  The  resulting 
LM  can  then  compute  the  required  HMM  transition 


probabilities  as3 

P(wiei\wiei  . . .  Wi-iei-i)  = 

P(wi\w1e1  . . .  Wi-iet-i) 

P(ei\w1e1 . . .  Wi^e^Wi) 

The  N-gram  estimator  maximizes  the  joint 
word+event  sequence  likelihood  P(W,E)  on  the 
training  data  (modulo  smoothing),  and  does  not 
guarantee  that  the  correct  event  posteriors  needed 
for  classification  according  to  Equation  (1)  arc 
maximized. 

The  second  set  of  HMM  parameters  arc  the  ob¬ 
servation  likelihoods  P(fi\ei,Wi).  Instead  of  train¬ 
ing  a  likelihood  model  we  make  use  of  the  prosodic 
classifiers  described  in  Section  3.  We  have  at  our 
disposal  decision  trees  that  estimate  P(ei\fi).  If 
we  further  assume  that  prosodic  features  arc  inde¬ 
pendent  of  words  given  the  event  type  (a  reasonable 
simplification  if  features  arc  chosen  appropriately), 
observation  likelihoods  may  be  obtained  by 

P(fi\v>uei)  =  ^Qp(fi)  (2) 

^ \ei ) 

Since  P(fi)  is  constant  we  can  ignore  it  when  car¬ 
rying  out  the  maximization  (1). 

4.1.3  Knowledge  Combination 

The  HMM  structure  makes  strong  independence  as¬ 
sumptions:  (1)  that  features  depend  only  on  the  cur¬ 
rent  state  (and  in  practice,  as  we  saw,  only  on  the 
event  label)  and  (2)  that  each  word+event  label  de¬ 
pends  only  on  the  last  JV  —  1  tokens.  In  return,  we 
get  a  computationally  efficient  structure  that  allows 
information  from  the  entire  sequence  W,  F  to  in¬ 
form  the  posterior  probabilities  needed  for  classifi¬ 
cation,  via  the  forward-backward  algorithm. 

More  problematic  in  practice  is  the  integration 
of  multiple  word-level  features,  such  as  POS  tags 
and  chunker  output.  Theoretically,  all  tags  could 
simply  be  included  in  the  hidden  state  representa¬ 
tion  to  allow  joint  modeling  of  words,  tags,  and 
events.  However,  this  would  drastically  increase  the 
size  of  the  state  space,  making  robust  model  estima¬ 
tion  with  standard  N-gram  techniques  difficult.  A 
method  that  works  well  in  practice  is  linear  inter¬ 
polation,  whereby  the  conditional  probability  esti¬ 
mates  of  various  models  arc  simply  averaged,  thus 
reducing  variance.  In  our  case,  we  obtain  good  re¬ 
sults  by  interpolating  a  word-N-gram  model  with 

3  To  utilize  word+event  contexts  of  length  greater  than  one 
we  have  to  employ  HMMs  of  order  2  or  greater,  or  equivalently, 
make  the  entire  word+event  N-gram  be  the  state. 


one  based  on  automatically  induced  word  classes 
(Brown  et  al.,  1992). 

Similarly,  we  can  interpolate  LMs  trained  from 
different  corpora.  This  is  usually  more  effective 
than  pooling  the  training  data  because  it  allows  con¬ 
trol  over  the  contributions  of  the  different  sources. 
For  example,  we  have  a  small  corpus  of  training  data 
labeled  precisely  to  the  LDC’s  SU  specifications, 
but  a  much  larger  (130M  word)  corpus  of  standard 
broadcast  new  transcripts  with  punctuation,  from 
which  an  approximate  version  of  SUs  could  be  in¬ 
ferred.  The  larger  corpus  should  get  a  larger  weight 
on  account  of  its  size,  but  a  lower  weight  given  the 
mismatch  of  the  SU  labels.  By  tuning  the  interpola¬ 
tion  weight  of  the  two  LMs  empirically  (using  held- 
out  data)  the  right  compromise  was  found. 

4.2  Maxent  Posterior  Probability  Model 

As  observed,  HMM  training  does  not  maximize  the 
posterior  probabilities  of  the  correct  labels.  This 
mismatch  between  training  and  use  of  the  model 
as  a  classifier  would  not  arise  if  the  model  directly 
estimated  the  posterior  boundary  label  probabilities 
P(ei\W,F).  A  second  problem  with  HMMs  is  that 
the  underlying  N-gram  sequence  model  does  not 
cope  well  with  multiple  representations  (features)  of 
the  word  sequence  (words,  POS,  etc.)  short  of  build¬ 
ing  a  joint  model  of  all  variables.  This  type  of  sit¬ 
uation  is  well-suited  to  a  maximum  entropy  formu¬ 
lation  (Berger  et  al.,  1996),  which  allows  condition¬ 
ing  features  to  apply  simultaneously,  and  therefore 
gives  greater  freedom  in  choosing  representations. 
Another  desirable  characteristic  of  maxent  models 
is  that  they  do  not  split  the  data  recursively  to  condi¬ 
tion  their  probability  estimates,  which  makes  them 
more  robust  than  decision  trees  when  training  data 
is  limited. 

4.2.1  Model  Formulation  and  Training 

We  built  a  posterior  probability  model  for  sentence 
boundary  classification  in  the  maxent  framework. 
Such  a  model  takes  the  familial-  exponential  form4 

P(e\W  F)  =  - - - eT,hXk9k(e,W,F )  /o\ 

1  1  ’  ’  Z\(W,  F) 6  1 

where  Z\(W,F)  is  the  normalization  term: 

ZX{W,  F)  =  Y^  Xk9k(e',W,F)  (4) 

e1 

The  functions  gj,(e,W,F)  are  indicator  functions 
corresponding  to  (complex)  features  defined  over 

4We  omit  the  index  i  from  e  here  since  the  “current”  event 
is  meant  in  all  cases. 


events,  words,  and  prosodic  features.  For  example, 
one  such  feature  function  might  be: 

,  ,  _  J  1  :  if  ruj  =  uhhuh  and  e  =  SU 

^e’  ’  '  \  0  :  otherwise 

The  maxent  model  is  estimated  by  finding  the  pa¬ 
rameters  A/,  such  that  the  expected  values  of  the  var¬ 
ious  feature  functions  Ep[g/.(e' ,W,  F)]  match  the 
empirical  averages  in  the  training  data.  It  can  be 
shown  that  the  resulting  model  has  maximal  entropy 
among  all  the  distributions  satisfying  these  expec¬ 
tation  constraints.  At  the  same  time,  the  parame¬ 
ters  so  chosen  maximize  the  conditional  likelihood 
n,  P(ei\W,  F)  over  the  training  data,  subject  to  the 
constraints  of  the  exponential  form  given  by  Equa¬ 
tion  (3). 5  The  conditional  likelihood  is  closely  re¬ 
lated  to  the  individual  event  posteriors  used  for  clas¬ 
sification,  meaning  that  this  type  of  model  explicitly 
optimizes  discrimination  of  collect  from  incorrect 
labels. 

4.2.2  Choice  of  Features 

Even  though  the  mathematical  formulation  gives  us 
the  freedom  to  use  features  that  are  overlapping  or 
otherwise  dependent,  we  still  have  to  choose  a  sub¬ 
set  that  is  informative  and  parsimonious,  so  as  to 
give  good  generalization  and  robust  parameter  es¬ 
timates.  Various  feature  selection  algorithms  for 
maxent  models  have  been  proposed,  e.g.,  (Berger  et 
al.,  1996).  However,  since  computational  efficiency 
was  not  an  issue  in  our  experiments,  we  included  all 
features  that  corresponded  to  information  available 
to  our  baseline  approach,  as  listed  below.  We  did 
eliminate  features  that  were  triggered  only  once  in 
the  training  set  to  improve  robustness  and  to  avoid 
overconstraining  the  model. 

•  Word  N-grams.  We  use  combinations 
of  preceding  and  following  words  to  en¬ 
code  the  word  context  of  the  event,  e.g., 

<Wi>,  <Wi+ 1>,  <Wi,Wi+ 1>,  <Wi-X,Wi>, 

and  <Wi,  wi+1,  wi+2>, 
where  try  refers  to  the  word  before  the  bound¬ 
ary  of  interest. 

•  POS  N-grams.  POS  tags  are  the  same  as  used 
for  the  HMM  approach.  The  features  capturing 
POS  context  are  similar  to  those  based  on  word 
tokens. 

•  Chunker  tags.  These  arc  used  similarly  to  POS 
and  word  features,  except  we  use  tags  encoding 

Tn  our  experiments  we  used  the  L-BFGS  parameter  estima¬ 
tion  method,  with  Gaussian-prior  smoothing  (Chen  and  Rosen- 
feld,  1999)  to  avoid  overfitting. 


chunk  type  (NP,  VP,  etc.)  and  word  position 
within  the  chunk  (beginning  versus  inside).6 

•  Word  classes.  These  are  similar  to  N-gram  pat¬ 
terns  but  over  automatically  induced  classes. 

•  Turn  flags.  Since  speaker  change  often  marks 
an  SU  boundary,  we  use  this  binary  feature. 
Note  that  in  the  HMM  approach  this  feature 
had  to  be  grouped  with  the  prosodic  features 
and  handled  by  the  decision  tree.  In  the  max¬ 
ent  approach  we  can  use  it  separately. 

•  Prosody.  As  we  described  earlier,  decision  tree 
classifiers  arc  used  to  generate  the  posterior 
probabilities  p(ej|/*).  Since  the  maxent  classi¬ 
fier  is  most  conveniently  used  with  binary  fea¬ 
tures,  we  encode  the  prosodic  posteriors  into 
several  binary  features  via  thresholding.  Equa¬ 
tion  (3)  shows  that  the  presence  of  each  fea¬ 
ture  in  a  maxent  model  has  a  monotonic  effect 
on  the  final  probability  (raising  or  lowering  it 
by  a  constant  factor  eXk9k).  This  suggests  en¬ 
coding  the  decision  tree  posteriors  in  a  cumu¬ 
lative  fashion  through  a  series  of  binary  fea¬ 
tures,  for  example,  p  >  0.1,  p  >  0.3,  p  >  0.5, 
p  >  0.7,  p  >  0.9.  This  representation  is  also 
more  robust  to  mismatch  between  the  posterior 
probability  in  training  and  test  set,  since  small 
changes  in  the  posterior  value  affect  at  most 
one  feature. 

Note  that  the  maxent  framework  does  allow  the 
use  of  real-valued  feature  functions,  but  pre¬ 
liminary  experiments  have  shown  no  gain  com¬ 
pared  to  the  binary  features  constructed  as  de¬ 
scribed  above.  Still,  this  is  a  topic  for  future 
research. 

•  Auxiliary  LM.  As  mentioned  earlier,  additional 
text-only  language  model  training  data  is  of¬ 
ten  available.  In  the  HMM  model  we  incor¬ 
porated  auxiliary  LMs  by  interpolation,  which 
is  not  possible  here  since  there  is  no  LM  per¬ 
se,  but  rather  N-gram  features.  However,  we 
can  use  the  same  trick  as  we  used  for  prosodic 
features.  A  word-only  HMM  is  used  to  esti¬ 
mate  posterior  event  probabilities  according  to 
the  auxiliary  LM,  and  these  posteriors  arc  then 
thrcsholded  to  yield  binary  features. 

•  Combined  features.  To  date  we  have  not  fully 
investigated  compound  features  that  combine 
different  knowledge  sources  and  are  able  to 
model  the  interaction  between  them  explicitly. 

6Chunker  features  were  only  used  for  broadcast  news 
data,  due  to  the  poor  chunking  performance  on  conversational 
speech. 


We  only  included  a  limited  set  of  such  features, 
for  example,  a  combination  of  the  decision  tree 
hypothesis  and  POS  contexts. 

4.3  Differences  Between  HMM  and  Maxent 

We  have  already  discussed  the  differences  between 
the  two  approaches  regarding  the  training  objective 
function  (joint  likelihood  versus  conditional  likeli¬ 
hood)  and  with  respect  to  the  handling  of  depen¬ 
dent  word  features  (model  interpolation  versus  in¬ 
tegrated  modeling  via  maxent).  On  both  counts  the 
maxent  classifier  should  be  superior  to  the  HMM. 
However,  the  maxent  approach  also  has  some  the¬ 
oretical  disadvantages  compared  to  the  HMM  by 
design.  One  obvious  shortcoming  is  that  some  in¬ 
formation  gets  lost  in  the  thresholding  that  converts 
posterior  probabilities  from  the  prosodic  model  and 
the  auxiliary  LM  into  binary  features. 

A  more  qualitative  limitation  of  the  maxent 
model  is  that  it  only  uses  local  evidence  (the  sur¬ 
rounding  word  context  and  the  local  prosodic  fea¬ 
tures).  In  that  respect,  the  maxent  model  resem¬ 
bles  the  conditional  probability  model  at  the  in¬ 
dividual  HMM  states.  The  HMM  as  a  whole, 
however,  through  the  forward-backward  procedure, 
propagates  evidence  from  all  parts  of  the  observa¬ 
tion  sequence  to  any  given  decision  point.  Valiants 
such  as  the  conditional  Markov  model  (CMM)  com¬ 
bine  sequence  modeling  with  posterior  probability 
(e.g.,  maxent)  modeling,  but  it  has  been  shown  that 
CMM’s  arc  still  structurally  inferior  to  HMMs  be¬ 
cause  they  only  propagate  evidence  forward  in  time, 
not  backwards  (Klein  and  Manning,  2002). 

5  Results  and  Discussion 

5.1  Experimental  Setup 

Experiments  comparing  the  two  modeling  ap¬ 
proaches  were  conducted  on  two  corpora:  broad¬ 
cast  news  (BN)  and  conversational  telephone  speech 
(CTS).  BN  and  CTS  differ  in  genre  and  speaking 
style.  These  differences  arc  reflected  in  the  fre¬ 
quency  of  SU  boundaries:  about  14%  of  inter-word 
boundaries  are  SUs  in  CTS,  compared  to  roughly 
8%  in  BN. 

The  corpora  are  annotated  by  LDC  according  to 
the  guidelines  of  (Strassel,  2003).  Training  and  test 
data  are  those  used  in  the  DARPA  Rich  Transcrip¬ 
tion  Fall  2003  evaluation.7  For  CTS,  there  is  about 
40  hours  of  conversational  data  from  the  Switch¬ 
board  coipus  for  training  and  6  hours  (72  conversa¬ 
tions)  for  testing.  The  BN  data  has  about  20  hours 

7We  used  both  the  development  set  and  the  evaluation  set 
as  the  test  set  in  this  paper,  in  order  to  have  a  larger  test  set  to 
make  the  results  more  meaningful. 


HMM 

Maxent 

Combined 

BN 

REF 

48.72 

48.61 

46.79 

STT 

55.37 

56.51 

54.35 

CTS 

REF 

31.51 

30.66 

29.30 

STT 

42.97 

43.02 

41.88 

Table  1:  SU  detection  results  (error  rate  in  %)  using 
maxent  and  HMM  individually  and  in  combination  on 
BN  and  CTS. 


of  broadcast  news  shows  in  the  training  set  and  3 
hours  (6  shows)  in  the  test  set.  The  SU  detection 
task  is  evaluated  on  both  the  reference  transcriptions 
(REF)  and  speech  recognition  outputs  (STT).  The 
speech  recognition  output  is  obtained  from  the  SRI 
recognizer  (Stolcke  et  al.,  2003). 

System  performance  is  evaluated  using  the  offi¬ 
cial  NIST  evaluation  tools,8  which  implement  the 
metric  described  earlier.  In  our  experiments,  we 
compare  how  the  two  approaches  perform  individ¬ 
ually  and  in  combination.  The  combined  classifier 
is  obtained  by  simply  averaging  the  posterior  esti¬ 
mates  from  the  two  models,  and  then  picking  the 
event  type  with  the  highest  probability  at  each  posi¬ 
tion. 

We  also  investigate  other  experimental  factors, 
such  as  the  impact  of  the  speech  recognition  errors, 
the  impact  of  genre,  and  the  contribution  of  text  ver¬ 
sus  prosodic  information  in  each  model. 

5.2  Experimental  Results 

Table  1  shows  SU  detection  results  for  BN  and 
CTS,  using  both  reference  transcriptions  and  speech 
recognition  output,  using  the  HMM  and  the  max¬ 
ent  approach  individually  and  in  combination.  The 
maxent  approach  slightly  outperforms  the  HMM  ap¬ 
proach  when  evaluating  on  the  reference  transcripts, 
and  the  combination  of  the  two  approaches  achieves 
the  best  performance  for  all  tasks  (significant  at 
p  <  0.05  using  the  sign  test  on  the  reference  tran¬ 
scription  condition,  mixed  results  on  using  recogni¬ 
tion  output). 

5.2.1  BN  vs.  CTS 

The  detection  error  rate  on  CTS  is  lower  than  on 
BN.  This  may  be  due  to  the  metric  used  for  per¬ 
formance.  Detection  error  rate  is  measured  as  the 
percentage  of  errors  per  reference  SU.  The  number 
of  SUs  in  CTS  is  much  larger  than  for  BN,  making 
the  relative  error  rate  lower  for  the  conversational 
speech  task.  Notice  also  from  Table  1  that  maxent 
yields  more  gain  on  CTS  than  on  BN  (for  the  refer¬ 
ence  transcription  condition  on  both  corpora).  One 
possible  reason  for  this  is  that  we  have  more  train- 

8http://www.nist.gov/speech/tests/rt/rt2003/fall/ 


Del 

Ins 

Total 

BN 

HMM 

28.48 

20.24 

48.72 

Maxent 

32.06 

16.54 

48.61 

CTS 

HMM 

17.19 

14.32 

31.51 

Maxent 

19.97 

10.69 

30.66 

Table  2:  Error  rates  for  the  two  approaches  on  reference 
transcriptions.  Performance  is  shown  in  deletion,  inser¬ 
tion,  and  total  error  rate  (%). 


BN 

CTS 

HMM 

Textual 

67.48 

38.92 

Textual  +  prosody 

48.72 

31.51 

Maxent 

Textual 

63.56 

36.32 

Textual  +  prosody 

48.61 

30.66 

Table  3:  SU  detection  error  rate  (%)  using  different 
knowledge  sources,  for  BN  and  CTS,  evaluated  on  the 
reference  transcription. 


ing  data  and  thus  less  of  a  sparse  data  problem  for 
CTS. 

5.2.2  Error  Type  Analysis 

Table  2  shows  error  rates  for  the  HMM  and  the  max- 
ent  approaches  in  the  reference  condition.  Due  to 
the  reduced  dependence  on  the  prosody  model,  the 
errors  made  in  the  maxent  approach  are  different 
from  the  HMM  approach.  There  are  more  deletion 
errors  and  fewer  insertion  errors,  since  the  prosody 
model  tends  to  overgenerate  SU  hypotheses.  The 
different  error  patterns  suggest  that  we  can  effec¬ 
tively  combine  the  system  output  from  the  two  ap¬ 
proaches.  As  shown  in  the  Table  1,  the  combination 
of  maxent  and  HMM  consistently  yields  the  best 
performance. 

5.2.3  Contribution  of  Knowledge  Sources 

Table  3  shows  SU  detection  results  for  the  two  ap¬ 
proaches,  using  textual  information  only,  as  well  as 
in  combination  with  the  prosody  model  (which  arc 
the  same  results  as  shown  in  Table  1).  We  only  re¬ 
port  the  results  on  the  reference  transcription  con¬ 
dition,  in  order  to  not  confound  the  comparison  by 
word  recognition  errors. 

The  superior  results  for  text-only  classification 
arc  consistent  with  the  maxent  model’s  ability  to 
combine  overlapping  word-level  features  in  a  prin¬ 
cipled  way.  However,  the  HMM  largely  catches 
up  once  prosodic  information  is  added.  This  can 
be  attributed  to  the  loss-less  integration  of  prosodic 
posteriors  in  the  HMM,  as  well  as  the  fact  that  in 
the  HMM,  each  boundary  decision  is  affected  by 
prosodic  information  throughout  the  data;  whereas, 
the  maxent  model  only  uses  the  prosodic  features  at 
the  boundary  to  be  classified. 


5.2.4  Effect  of  Recognition  Errors 

We  observe  in  Table  1  that  there  is  a  large  increase  in 
error  rate  when  evaluating  on  the  speech  recognition 
output.  This  happens  in  paid  because  word  informa¬ 
tion  is  inaccurate  in  the  recognition  output,  thus  im¬ 
pacting  the  LMs  and  lexical  features.  The  prosody 
model  is  also  affected,  since  the  alignment  of  incor¬ 
rect  words  to  the  speech  is  imperfect,  thereby  affect¬ 
ing  the  prosodic  feature  extraction.  However,  the 
prosody  model  is  more  robust  to  recognition  errors 
than  the  LMs,  due  to  its  lesser  dependence  on  word 
identity.  The  degradation  on  CTS  is  larger  than  on 
BN.  This  can  easily  be  explained  by  the  difference 
in  word  error  rates,  22.9%  on  CTS  and  12.1%  on 
BN. 

The  maxent  system  degrades  more  than  then 
HMM  system  when  errorful  recognition  output  is 
used.  In  light  of  the  previous  section,  this  makes 
sense:  most  of  the  improvement  of  the  maxent 
model  comes  from  better  lexical  feature  modeling. 
But  these  arc  exactly  the  features  that  arc  most  de¬ 
teriorated  by  faulty  recognition  output. 

6  Conclusions  and  Future  Work 

We  have  described  two  different  approaches  for 
modeling  and  integration  of  diverse  knowledge 
sources  for  automatic  sentence  segmentation  from 
speech:  a  state-of-the-art  approach  based  on 

HMMs,  and  an  alternative  approach  based  on  pos¬ 
terior  probability  estimation  via  maximum  entropy. 
To  achieve  competitive  performance  with  the  max¬ 
ent  model  we  devised  a  cumulative  binary  coding 
scheme  to  map  posterior  estimates  from  auxiliary 
submodels  into  features  for  the  maxent  model. 

The  two  approaches  have  complementary 
strengths  and  weaknesses  that  were  reflected  in  the 
results,  consistent  with  the  findings  for  text-based 
NLP  tasks  (Klein  and  Manning,  2002).  The  maxent 
model  showed  much  better  accuracy  than  the  HMM 
with  lexical  information,  and  a  smaller  win  after 
combination  with  prosodic  features.  The  HMM 
made  more  effective  use  of  prosodic  information 
and  degraded  less  with  errorful  word  recognition. 
A  interpolation  of  posterior  probabilities  from  the 
two  systems  achieved  2-7%  relative  error  reduction 
compared  to  the  baseline  (significant  at  p  <  0.05 
for  the  reference  transcription  condition).  The 
results  were  consistent  for  two  different  genres  of 
speech. 

In  future  work  we  hope  to  determine  how  the  in¬ 
dividual  qualitative  differences  of  the  two  models 
(estimation  methods,  model  structure,  etc.)  con¬ 
tribute  to  the  observed  differences  in  results.  To 
improve  results  overall,  we  plan  to  explore  features 


that  combine  multiple  knowledge  sources,  as  well 
as  approaches  that  model  recognition  uncertainty  in 
order  to  mitigate  the  effects  of  word  errors.  We  also 
plan  to  investigate  using  a  conditional  random  field 
(CRF)  models.  CRFs  combine  the  advantages  of 
both  the  HMM  and  the  maxent  approaches,  being 
a  discriminatively  trained  model  that  can  incorpo¬ 
rate  overlapping  features  (the  maxent  advantages), 
while  also  modeling  sequence  dependencies  (an  ad¬ 
vantage  of  HMMs)  (Lafferty  et  al.,  2001). 
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